-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-8971. Example integration with Iceberg, Spark and Trino #5016
Conversation
This commit does not contain secrets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanx @adoroszlai , just started exploring this.
A quick question: we are using tabulario
image which isn't official but from a vendor, which we won't have any control, are we ok using that?
Spark/Iceberg might be having official image may be, can we use that?
Hive does have official docker image, in case you want to explore: https://hub.docker.com/r/apache/hive
I tried iceberg with that
export HIVE_VERSION=4.0.0-alpha-2
docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:${HIVE_VERSION}
docker exec -it hive4 beeline -u 'jdbc:hive2://localhost:10000/'
create table ice01 (id int) stored by iceberg;
show create table ice01;
insert into ice01 values (1),(2),(3),(4);
select * from ice01;
The show create table ice01;
shows iceberg, which confirms the table is iceberg, I think I didn't see that it is mentioned anywhere, may be those guys configured some default or so.
I think you are good with v1 table which doesn't support deletes/updates as in the current example in this PR. (https://iceberg.apache.org/spec/#format-versioning)
It is pretty easy as well, just a tbl property and we are sorted for v2
create table ice02 (id int) stored by iceberg tblproperties ('format-version'='2');
so, we can do it in future as well. :-)
Thanks @ayushtkn for starting to review.
I found this image from Tabular at https://iceberg.apache.org/spark-quickstart/ - if there was an official Apache Iceberg image, I guess they would have used that in the example. I'm open to using any other image. BTW, this is just a small experiment to help answer #4973. Spark does have official images, will explore those. |
@SaketaChalamchala please take a look, too |
DESCRIBE iceberg.nyc.taxis; | ||
INSERT INTO iceberg.nyc.taxis VALUES (2, 1000375, 7.2, 555, 'N'); | ||
SELECT * FROM iceberg.nyc.taxis; | ||
EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @adoroszlai. If this is going to be an example of Trino + Iceberg, would it make sense to remove the dependency on spark and create the table in Trino like below?
CREATE TABLE IF NOT EXISTS iceberg.nyc.taxis
(
vendor_id bigint,
trip_id bigint,
trip_distance double,
fare_amount double,
store_and_fwd_flag varchar
)
WITH (
format = 'PARQUET'
location = 's3://warehouse/nyc/taxis');
INSERT INTO iceberg.nyc.taxis VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y');
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SaketaChalamchala for the review.
Let's call it an Iceberg, Spark, Trino example instead. :) (I followed the "Spark and Iceberg Quickstart" guide for the Iceberg part.)
There's a code against again the latest. |
@jojochuang thanks for taking a look. Conflict has been resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good.
But I'm afraid of patent or license issues like hell. Looking at the source code for the tabulario/spark-iceberg docker image (https://github.com/tabular-io/docker-spark-iceberg/blob/main/docker-compose.yml)
It includes MinIO and MinIO is AGPL. I want to make sure this is okay.
@jojochuang we don't distribute MinIO in any way. Users running this example download the MinIO docker image from Docker Hub. But I'm fine abandoning this PR. |
What changes were proposed in this pull request?
Create add-on for
ozone
docker-compose environment to demonstrate integration with Iceberg and Trino.https://issues.apache.org/jira/browse/HDDS-8971
How was this patch tested?
Added test script to verify the setup:
spark-shell
(example taken from Spark and Iceberg Quickstart)trino
CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/5437385105
(Interesting part begins here.)